{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "import datetime"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyspark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc = pyspark.SparkContext()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 生成100万个字母组成的随机列表"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = [chr(random.randint(65,91)) for i in range(1000000)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 生成利用转换算子parallelize将列表转换为RDD"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 生成了一个管道集合RDD，顾名思义，只是生成一个管道，而不是实体"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0:00:00.222304\n"
     ]
    }
   ],
   "source": [
    "s = datetime.datetime.now()\n",
    "rdd = sc.parallelize(x)\n",
    "print(datetime.datetime.now() - s)\n",
    "print(rdd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 执行count这个行动算子，开始提交任务，所以整体执行时间需要5秒"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0:00:05.061927\n",
      "1000000\n"
     ]
    }
   ],
   "source": [
    "s = datetime.datetime.now()\n",
    "x2 = rdd.count()\n",
    "print(datetime.datetime.now() - s)\n",
    "print(x2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 在rdd算子上面执行一个转换算子map，使每个字母进行一个计数，实际上就是一个转换过程：\n",
    "### 集合（\"w\"） ——> 集合(\"w\",1) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## map算子作为一个转换算子，仅仅记录一个函数过程链，而不发生实际的操作，所以完全没有花费任何执行时间。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0:00:00\n",
      "PythonRDD[4] at RDD at PythonRDD.scala:48\n"
     ]
    }
   ],
   "source": [
    "s = datetime.datetime.now()\n",
    "rdd2 = rdd.map(lambda x : (x,1))\n",
    "print(datetime.datetime.now() - s)\n",
    "print(rdd2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 同上，reduceByKey算子，也是一个转换算子，转换的结果是得到一个同key相交的结果，如："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ((\"w\",1),(\"w\",1),(\"w\",1),(\"z\",1),(\"z\",1))  ——>\n",
    "### ((\"w\",3),(\"z\",2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 同样，reduceByKey算子也不会触发执行，所以也没有花费执行时间"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0:00:00.034910\n",
      "PythonRDD[15] at RDD at PythonRDD.scala:48\n"
     ]
    }
   ],
   "source": [
    "s = datetime.datetime.now()\n",
    "rdd3 = rdd2.reduceByKey(lambda a,b : a+b)\n",
    "print(datetime.datetime.now() - s)\n",
    "print(rdd3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 最后调用collect算子，这是一个action算子，开始执行全环节运行，全部函数链执行。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0:00:09.629967\n",
      "[('[', 36920), ('O', 36954), ('N', 37148), ('J', 36862), ('B', 36882), ('H', 36955), ('W', 37045), ('C', 37055), ('R', 36978), ('X', 37271), ('V', 36881), ('Z', 37272), ('I', 36911), ('K', 37436), ('L', 36595), ('A', 36775), ('T', 37037), ('M', 37113), ('Q', 36969), ('U', 37495), ('F', 37137), ('S', 37112), ('G', 36919), ('Y', 36865), ('D', 37267), ('P', 36932), ('E', 37214)]\n"
     ]
    }
   ],
   "source": [
    "s = datetime.datetime.now()\n",
    "res = rdd3.collect()\n",
    "print(datetime.datetime.now() - s)\n",
    "print(res)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
